Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation
نویسندگان
چکیده
Current state-of-the-art Statistical Machine Translation systems are based on log-linear models that combine a set of feature functions to score translation hypotheses during decoding. The models are parametrized by a vector of weights usually optimized on a set of sentences and their reference translations, called development data. In this paper, we explore a (common and industry relevant) scenario where a system trained and tuned on general domain data needs to be adapted to a specific domain for which no or only very limited in-domain bilingual data is available. It turns out that systems can be adapted successfully by re-tuning model parameters using surprisingly small amounts of parallel in-domain data, by cross-tuning or no tuning at all. We show in detail how and why this is effective, compare the approaches and effort involved. We also study the effect of hyperparameters (such as maximum phrase length and development data size) and their optimal values in this scenario. TITLE AND ABSTRACT IN CZECH Jednoduchá a efektivní optimalizace parametrů pro doménovou adaptaci statistického strojového překladu Současné systémy statistického strojového překladu jsou založeny na logarotmickolineárních modelech, které ve fázi dekódování kombinují příznakové funkce pro hodnocení překladových hypotéz. Tyto modely jsou parametrizovány vektorem vah, které se optimalizují na tzv. vývojových datech, což je množina vět a jejich referenčních překladů. V článku se zabýváme (častou a pro průmyslové nasazení relevantní) situací, kdy je ťreba překladový systém natrénovaný na datech z obecné domény adaptovat na nějakou specifickou doménu, pro kterou jsou k dispozici paralelní data jen ve velice omezeném (či žádném) množství. Ukazujeme, že takové systémy mohou být vhodně adaptovány pomocí optimalizace parametrů, a to za použité jen překvapivě malého množství paralelních doménově-specifických dat, či tzv. křížovou optimalizací, nebo bez použití optimalizace vůbec. Toto pozorování důkladně analyzujeme, porovnáváme použité přístupy a jejich celkovou náročnost. Dále se zabýváme analýzou hyperparametrů (např. maximální délkou frází a velikostí vývojových dat) a jejich optimalizací.
منابع مشابه
Improved Domain Adaptation for Statistical Machine Translation
We present a simple and effective infrastructure for domain adaptation for statistical machine translation (MT). To build MT systems for different domains, it trains, tunes and deploys a single translation system that is capable of producing adapted domain translations and preserving the original generic accuracy at the same time. The approach unifies automatic domain detection and domain model...
متن کاملDomain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study
In this research, we tackle the problem of domain adaptation of Statistical Machine Translation by exploiting domainspecific data acquired by domain-focused web-crawling. We design and empirically evaluate a procedure for automatic acquisition of both monolingual and parallel data and their exploitation for system training, tuning, and testing in a phrase-based Statistical Machine Translation f...
متن کاملTowards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation
This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase-based machine translation framework. The proposed approach is evaluated on ...
متن کاملFine-Tuning for Neural Machine Translation with Limited Degradation across In- and Out-of-Domain Data
Neural machine translation is a recently proposed approach which has shown competitive results to traditional MT approaches. Similar to other neural network based methods, NMT also suffers from low performance for the domains with less available training data. Domain adaptation deals with improving performance of a model trained on large general domain data over test instances from a new domain...
متن کاملAutomatic Tune Set Generation for Machine Translation with Lim- ited In-domain Data
Many effective adaptation techniques for statistical machine translation crucially rely on in-domain development sets to learn model parameters. In this paper we present a novel method that automatically generates the matching tune set for Arabic-to-English MT with limited indomain data 1 . This technique improves our MT system over two baselines (tuned on data from the same domain but differen...
متن کامل